light fixture
Altogether: Image Captioning via Re-aligning Alt-text
Xu, Hu, Huang, Po-Yao, Tan, Xiaoqing Ellen, Yeh, Ching-Feng, Kahn, Jacob, Jou, Christine, Ghosh, Gargi, Levy, Omer, Zettlemoyer, Luke, Yih, Wen-tau, Li, Shang-Wen, Xie, Saining, Feichtenhofer, Christoph
This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.
- Oceania > Australia > Tasmania (0.04)
- North America > United States > North Carolina (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Yanuka, Moran, Kish, Assaf Ben, Bitton, Yonatan, Szpektor, Idan, Giryes, Raja
Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
Dagan, Gautier, Loginova, Olga, Batra, Anil
Vision Language Models (VLMs) are typically evaluated with Visual Question Answering (VQA) tasks which assess a model's understanding of scenes. Good VQA performance is taken as evidence that the model will perform well on a broader range of tasks that require both visual and language inputs. However, scene-aware VQA does not fully capture input biases or assess hallucinations caused by a misalignment between modalities. To address this, we propose a Cross-modal Alignment Similarity Test (CAST) to probe VLMs for self-consistency across modalities. This test involves asking the models to identify similarities between two scenes through text-only, image-only, or both and then assess the truthfulness of the similarities they generate. Since there is no ground-truth to compare against, this evaluation does not focus on objective accuracy but rather on whether VLMs are internally consistent in their outputs. We argue that while not all self-consistent models are capable or accurate, all capable VLMs must be self-consistent.
- North America > United States > California (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > United Kingdom > Scotland (0.04)
- Asia > Singapore (0.04)
Are smart bulbs worth the money? 10 things you need to know
Whether you're arriving home late at night or just need some extra assistance around the house, the addition of smart light bulbs to your living space can help you find your way around without fumbling in the dark for a traditional light switch. Smart light bulbs are just like regular light bulbs except that they can be controlled remotely by voice, Bluetooth, and mobile apps. If you're teetering on the edge of setting up your house with smart home products, light bulbs are a great place to start. However, before you outfit your house with smart light bulbs, there are some things you need to know before making your purchase. It's easy to install a smart light bulb. Just remove your current bulbs and screw in the new smart light bulbs.
NAB 2018: The year of Artificial Intelligence - Screen Africa
This year attendance at NAB 2018 was down on previous years but despite the lower visitor numbers the conference content and exhibition produced the buzz and excitement that NAB is synonymous for. This year's show had a little bit of everything, but the main trends seem to revolve around RGB lighting, large format cameras, and a game changing codec, whilst the conference sessions followed some interesting threads under the umbrella of next-generation technologies, namely artificial intelligence (AI), immersive media and cyber security. From production to distribution, artificial intelligence has taken the broadcast and filmmaking industries by storm. The 2018 edition of NAB dedicated time and space to showcase some of the developments in AI with conference sessions like "Machine Intelligence: The Evolution of Content Production Aided by Machine Learning", "Optimising Production with Neural Networks", "How Machine Intelligence is Transforming Editorial", "New Frontiers in Animation and Computer Graphics", "From Dailies to Master – Machine Intelligence Comes to Video Workflows" and, finally, "The Future of Content with Machine Intelligence". The series of sessions looked at machine learning, deep learning and artificial intelligence technologies and at how studios, networks and creative service companies can use them to help produce content.
- Africa (0.40)
- North America > United States > Nevada > Clark County > Las Vegas (0.05)
- Leisure & Entertainment (0.97)
- Information Technology > Security & Privacy (0.72)
- Media > Film (0.48)
Never touch a switch again with these 10 smart lighting options
Add'smart lighting' to your home and never touch a switch again (Photo: Philips Hue) If you make a purchase by clicking one of our links, we may earn a small share of the revenue. Our picks and opinions are independent from any business incentives. The average light fixture can be pretty boring. Sometimes it has buttons, a toggle, or dimming features. Sometimes the wall plate comes in different colors.